Example: The following code will take you through your first Kaggle competition submission: House Prices: Advanced Regression Techniques.
library(tidyverse)
# 1. Load in training and test data
train <- read_csv("https://rudeboybert.github.io/SDS293/static/train.csv")
test <- read_csv("https://rudeboybert.github.io/SDS293/static/test.csv")
sample_submission <- read_csv("https://rudeboybert.github.io/SDS293/static/sample_submission.csv")
# 2.a) Exploratory data analysis! EDA! Look at your data!
# View(train)
# View(test)
# View(sample_submission)
# 2.b) Compute a single predicted value for all houses: the mean sale price of
# all houses in training set
y_hat_1 <- mean(train$SalePrice)
y_hat_1## [1] 180921.2
# 3.a) Exploratory data analysis! EDA! Visualizations
# Distribution of numerical outcome variable
ggplot(train, aes(x = SalePrice)) +
geom_histogram() +
labs(x = "Sale price in dollars") +
geom_vline(xintercept = y_hat_1, col="red")# 3.b) Distribution of numerical outcome variable on a log10-scale
ggplot(train, aes(x = SalePrice)) +
geom_histogram() +
scale_x_log10() +
labs(x = "Sale price in dollars (log10-scale)")# 4. Apply fitted model to get predictions for test data. Note format of
# submission data frame must match sample_submission.
submission <- test %>%
select(Id) %>%
mutate(SalePrice = y_hat_1)
# Look at your data!
# View(submission)
# 5. Output predictions to CSV for submission to Kaggle
write_csv(submission, "submission.csv")Exercise: You will now make a submission to Kaggle using a linear regression model. Do the following:
SalePrice.Hint: Here is code that will allow you to fit a linear regression model to a training set and then apply it to the test set to get predictions
# Split mtcars data frame into two parts
mtcars_train <- mtcars %>%
slice(1:16)
mtcars_test <- mtcars %>%
slice(17:32)
# 1) Fit regression model to training data
model_2 <- lm(mpg ~ hp, data = mtcars_train)
# 2) Get predictions for test data. Note:
# - The output here is a vector
# - We'll see later on that there is more than one way to get predictions from a fitted model
y_hat_2 <- predict(model_2, newdata = mtcars_test)Solutions:
The iris dataset is a canonical dataset in statistics and machine learning (Wikipedia). Introduced by Fisher in his 1936 paper “The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.” For 150 iris flowers it has 5 variables.
As categorical outcome variable \(y\) one of three species of iris flower:
| Setosa | Versicolor | Virginica |
|---|---|---|